On the Construction of a Large Scale Chinese Web Test Collection

نویسندگان

  • Hongfei Yan
  • Chong Chen
  • Bo Peng
  • Xiaoming Li
چکیده

The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructive Dynamisms of Large-Scale Urban Projects by the Space Political Economy Approach; a Case Study of Mashhad Metropolis

Aims: The development of large-scale construction projects has transformed the shape of cities towards specific objectives and based on economic and political perspectives that dominate policy-making and planning in cities. The purpose of the research was to study and analyze the spatiality of Mashhad construction mega-projects and to explain the constructive forces and dynamisms of these proje...

متن کامل

Semantic Constraint and QoS-Aware Large-Scale Web Service Composition

Service-oriented architecture facilitates the running time of interactions by using business integration on the networks. Currently, web services are considered as the best option to provide Internet services. Due to an increasing number of Web users and the complexity of users’ queries, simple and atomic services are not able to meet the needs of users; and to provide complex services, it requ...

متن کامل

Very Large Scale Retrieval and Web Search (Preprint version)

Together, the TREC Very Large Collection (VLC) Track and its successor the Web Track have run for seven years, after an initial VLC pre-track. During that time five new test collections have been created, five different types of retrieval task have been studied, a large number of important issues have been addressed, and new methods have been tried, not only for retrieval, but also for test col...

متن کامل

Chinese-English Parallel Corpus Construction and its Application

Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...

متن کامل

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Ontology is the main infrastructure of the Semantic Web which provides facilities for integration, searching and sharing of information on the web. Development of ontologies as the basis of semantic web and their heterogeneities have led to the existence of ontology matching. By emerging large-scale ontologies in real domain, the ontology matching systems faced with some problem like memory con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008